-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(prometheus): expose controlplane connectivity state as a gauge #14020
base: master
Are you sure you want to change the base?
Conversation
c2d6278
to
697c1e6
Compare
697c1e6
to
de4a868
Compare
de4a868
to
ee922f2
Compare
kong/clustering/data_plane.lua
Outdated
local function set_control_plane_connected(reachable, ttl) | ||
local ok, err = ngx.shared.kong:safe_set("control_plane_connected", reachable, ttl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just hard-code the ttl as PING_WAIT
since it's what we should use in all cases.
local function set_control_plane_connected(reachable, ttl) | |
local ok, err = ngx.shared.kong:safe_set("control_plane_connected", reachable, ttl) | |
local function set_control_plane_connected(reachable) | |
local ok, err = ngx.shared.kong:safe_set("control_plane_connected", reachable, PING_WAIT) |
kong/clustering/data_plane.lua
Outdated
local function set_control_plane_connected(reachable, ttl) | ||
local ok, err = ngx.shared.kong:safe_set("control_plane_connected", reachable, ttl) | ||
if not ok then | ||
ngx_log(ngx_ERR, _log_prefix, "failed to set controlplane_reachable key in shm to ", reachable, " :", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: update the log line to match the current name of the SHM key
ngx_log(ngx_ERR, _log_prefix, "failed to set controlplane_reachable key in shm to ", reachable, " :", err) | |
ngx_log(ngx_ERR, _log_prefix, "failed to set \"control_plane_connected\" key in shm to ", reachable, " :", err) |
kong/plugins/prometheus/exporter.lua
Outdated
metrics.cp_connected = prometheus:gauge("control_plane_connected", | ||
"Kong connected to control plane, " .. | ||
"0 is unconnected", | ||
nil, | ||
prometheus.LOCAL_STORAGE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: formatting
metrics.cp_connected = prometheus:gauge("control_plane_connected", | |
"Kong connected to control plane, " .. | |
"0 is unconnected", | |
nil, | |
prometheus.LOCAL_STORAGE) | |
metrics.cp_connected = prometheus:gauge("control_plane_connected", | |
"Kong connected to control plane, " .. | |
"0 is unconnected", | |
nil, | |
prometheus.LOCAL_STORAGE) |
-- it takes some time for the cp<->dp connection to get established and the | ||
-- metric to reflect that, so set the timeout to 10 secs. | ||
assert.with_timeout(10).eventually(function() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggesting a more lenient timeout here to prevent the test from being flaky. It's not uncommon for stuff to be extra slow in CI.
-- it takes some time for the cp<->dp connection to get established and the | |
-- metric to reflect that, so set the timeout to 10 secs. | |
assert.with_timeout(10).eventually(function() | |
-- it takes some time for the cp<->dp connection to get established and the | |
-- metric to reflect that. On failure, re-connection attempts are spaced out | |
-- in `math.random(5, 10)` second intervals, so a generous timeout is used | |
-- in case we get unlucky and have to wait multiple retry cycles | |
assert.with_timeout(30).eventually(function() |
Add a new Prometheus gauge metric `control_plane_connected`. Similar to `datastore_reachable` gauge, 0 means the connection is not healthy; 1 means that the connection is healthy. We mark the connection as unhealthy under the following circumstances: * Failure while establihing a websocket connection * Failure while sending basic information to controlplane * Failure while sending ping to controlplane * Failure while receiving a packet from the websocket connection This is helpful for users running a signficant number of gateways to be alerted about potential issues any gateway(s) may be facing while talking to the controlplane. Signed-off-by: Sanskar Jaiswal <[email protected]>
1f1c1a3
to
cf5b522
Compare
Summary
Add a new Prometheus gauge metric
control_plane_connected
. Similar todatastore_reachable
gauge, 0 means the connection is not healthy; 1 means that the connection is healthy. We mark the connection as unhealthy under the following circumstances:This is helpful for users running a signficant number of gateways to be alerted about potential issues any gateway(s) may be facing while talking to the controlplane.
Checklist
changelog/unreleased/kong
orskip-changelog
label added on PR if changelog is unnecessary. README.mdIssue reference
Fix #[issue number]